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Abstract 

Grammars are used to describe sentences structure, 
thanks to some sets of rules, which depends on the 
grammar type. A classification of grammars has been 
made by Noam Chomsky, which led to four well-known 
types. Yet, there are other types of grammars, which do 
not exactly fit in Chomsky's classification, such as the 
two-level grammars. As their name suggests it, the main 
idea behind these grammars is that they are composed 
of two grammars. 

Van Wijngaarden grammars, particularly, are such 
grammars. They are interesting by their power ( expres- 
siveness ), which can be the same, under some hypothe- 
ses, as the most powerful grammars of Chomsky 's clas- 
sification, i.e. Type grammars. Another point of inter- 
est is their relative conciseness and readability. 

Van Wijngaarden grammars can describe static and 
dynamic semantic of a language. So, by using them as 
a generative engine, it is possible to generate a possibly 
infinite set of words, while assuring us that they all have 
the same semantic. Moreover, they can describe K-ary 
codes, by describing the semantic of each components 
of a code. 

1 Introduction 

Grammars are mostly used to describe languages, like 
programming languages, in order to parse them. In 
this paper, we are not interested in the parsing prob- 
lem of a language. On the contrary, the objective is to 
use a grammar from which the word parsing problem is 
known to be hard (as in NP), or even better, undecid- 
able. Indeed, if one wants to do some metamorphism 
through the use of a grammar, one may want to avoid 
grammars for which techniques to build practical word 



recognizers of a language are known. 

Van Wijngaarden grammars are different than the one 
which fall in Chomsky's classification. Their writing 
is particular, and above all, their production process is 
quite different than the grammars in Chomsky's hierar- 
chy. We will see that these grammars may be used as 
"code translators". They indeed have some rules which 
allow them to be very expressive. 

2 Metamorphism vs. Polymor- 
phism 

The difference between polymorphism and metamor- 
phism is often not very clear in people's mind, so we 
describe it quickly in this section. 

2.1 Polymorphism 

Polymorphism first appeared to counter the detection 
scheme of AV companies which was, and still is for 
a main part, based on signature matching. The aim of 
virus writers was to write a virus whose signature would 
change each time it evolves. In order to do so, the virus 
body is encrypted by an encryption function and it is 
decrypted by its decryptor at the runtime. The key used 
to encrypt each copy of the virus is changed, so that 
each copy has a different body (Figure[T]i. Another tech- 
nique that can be used is to apply a different encryption 
scheme for each copy of the code. Of course such a 
technique alone is not enough to evade signature detec- 
tion as it only shifts the problem, the decryptor being 
a good candidate for a signature. To resolve this, the 
decryption routine has to be changed too between each 
copy of the virus. To do so, virus writers include a mu- 
tation engine, which is also encrypted during the propa- 
gation process, and which is used to randomly generate 
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Figure 1: Two files infected by the same virus. 



a new decryption routine so it is different from copy to 
copy (Figure While in the first case the decryption 
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Figure 2: Two files infected by the same virus. 



routine can be used as a signature, this is not the case in 
the second one. Indeed, the decryption routine changes 
from mutation to mutation, thanks to the engine. 
The mutation engine cannot be used as a signature nei- 
ther, because it is a part of the body, thus it is en- 
crypted. The propagation process can be summed up 
in five steps : 

• The decryption routine decrypts the encrypted 
body; 

• The body is executed; 

• The code calls the mutation engine (which is de- 
crypted at this stage) to transform the decryption 
routine; 

• The code and the mutation engine are encrypted; 

• The transformed decryption routine and the new 
encrypted body are then appended onto a new pro- 
gram. 

2.2 Metamorphism 

Metamorphism differs from polymorphism in the fact 
that there is no use of a decryption routine, because 



there is no encryption process. In other words, while 
a polymorphic code has to decrypt itself before it can 
be executed, a metamorphic one is executed directly. 

Indeed, a metamorphic engine can be seen as a "se- 
mantic translator". The idea is to rewrite a given code 
into another syntactically different, yet semantically 
equivalent one (Figure [3J. Different techniques can be 
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Figure 3: Four equivalent codes. 



used to build an efficient metamorphic engine. Among 
these techniques we can observe : 

• Junk code insertion : a junk code is a code that is 
useless for the main code to perform its task. 

• Variable renaming : the variables used between 
different versions of the code are different. 

• Control flow modifications : some instructions 
are independent from each other, and so, can be 
swapped. Otherwise instructions can be shuffled 
and linked by jumps. 

3 Grammars 

In this section, we recall what formal grammars are and 
the link they have with languages. 

3.1 What is a grammar 

Definition 1. Let £ be a finite set of symbols called 
alphabet. A formal grammar G is defined by the 4-tuple 

G= (V N ,V T ,S,P) where: 

• Vn is a finite set of non-terminal symbols, 
V N n S* = 0; 

• Vt is a finite set of terminal symbols, 

V N n V T = 0; 

• S € Vn is the starting symbol of the grammar; 

• P C (V T U V n )* x (V T U V n )* is a set of pro- 
duction rules. 

Basically, a grammar can be seen as a set of rewrit- 
ing rules over an alphabet. An alphabet is a finite set of 
symbols (like 'a', '£>'). We distinguish two sets of sym- 
bols. The first one is the set of non-terminal symbols, 
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and the second is the set of terminal symbols. Non- 
terminal symbols are symbols which are used to be re- 
placed by the right-hand side of a production rule. On 
the contrary, terminal symbols are symbols which can- 
not be modified by a rule. Of course, the two sets are 
disjoint. A rewriting rule is a rule which defines how 
a given sequence of symbols can be rewritten into an- 
other sequence of symbols. A special symbol, called 
the start symbol, is used to specify where the rewriting 
must start. This particular symbol belongs to the set of 
non-terminal symbols. We then note G = (N, T, S, P) 
to define the grammar G composed of the set of non- 
terminal symbols N, the set of terminal symbols T, the 
starting symbol S, and the set of production (rewriting) 
rules P. 

Definition 2. Let G = (V n,Vt,S,P) be a formal 
grammar. The language described by G is L(G) = {x € 

E* | S ^* x} . 

Grammars are used to describe languages. A 
language is a set of words, each word being a 
sequence of symbols. A word may or may not 
have a meaning nor a structure. For instance, the 
grammar G = ({#}, {a}, S, {S -> aS; S -> a}) 
(here N — {S}, T — {a}, S = S, and 
P = {S — > aS; S —> a}) describes the language 
L(G) = {a n | n > 1} (i.e. the words 'a', 'aa', 'aaa', 
. . .). There exist different forms which are used to 
represent grammars. For convenience, we will write 
the production rules of a grammar as follows : 

S -> aS 
S -> a 

When some rules share the same left-hand side, as it is 
the case here, we can shrink the different alternatives in 
one rule, separated by a ' | ' : 

S -> aS | a 

To generate a word from these rules one proceeds as 
follows : start from the starting symbol and replace it 
by one of its alternatives. Then two cases have to be 
considered : 

- either a sequence of symbols of the produced sen- 
tential form matches the left-hand side of a rule; 

- either it is not the case and, if the sentential form 
does not contain any non-terminal symbols, it is a 
word of the language described by the grammar. 

Whenever a sequence of symbols matches the left-hand 
side of a rule, it is replaced by one of the alternatives of 
the rule, and the process goes on until no more match is 
found. 

As an example, take the above rule. The starting word 
is 'S' . Suppose that '£' produces the sentential form 



'a'. As 'a' does not match any left-hand side of the 
rules at our disposal, and as it is a terminal symbol, it is 
a word of the language. Now suppose that 'S' produces 
the sentential form 'aS\ The non-terminal 'S' in 'aS' 
matches the left-hand side of one of the rules, so we 
replace it by one of its alternatives : 'a' or 'aS\ We 
thus obtain the sentential forms 'aa or 'aaS'. Hence, 
the words generated by the above rule are : a, aa, aaa, 
aaaa, . . . 

Now, if we use some x86 instructions as the terminal 
symbols, we can write rules which will generate x86 
instructions sequences |Fil07b, Zbi09|. From a given 
sequence of instructions, it is easy to write a grammar 
which will generate it. For instance, the instruction se- 
quence : 

mov eax, key 
xor [ ebx ], eax 
inc ebx 

can be generated by the following production rules : 

S —> mov eax, key T 
T — > xor [ ebx ], eax U 
U —> inc ebx V 

The instruction sequence is thus represented by the se- 
quence of non-terminal symbols S-)T->U-)V, the 
non-terminal S being rewritten into the sentential form 
"mov eax, key T", which is then rewritten into the sen- 
tential form "mov eax, key xor [ ebx ], eax U", etc. . . 

Once the production rules are defined, one may want 
to generate an equivalent sequence of instructions. It is 
rather easy : 

S — > mov eax, key T | push key; pop eax T 
T — > xor [ ebx ], eax U | mov ecx, [ ebx ]; 

and ecx, eax; not ecx; or [ ebx ], eax; 

and [ ebx ], ecx U 
U -> inc ebx V | add ebx, 1 V 

The production rules now generate 8 (2x2x2) different 
sequences, each of them acting the same. In a same 
manner, one may want to add some junk code. This can 
be done by adding a new non-terminal which generates 
"useless" instruction^]]: 

S —> G mov eax, key T | G push key; pop eax T 

G —> add edx, 1 ; dec edx | push eax; add esp, 4 

For this example, the addition of the rule G, which is 
composed of only two alternatives, increases the num- 
ber of instruction sequences that can be generated to 

1 Care must be taken on the place where to add these instructions, 
as they may modify some flags which are check later, e.g. by a jcc 
instruction. 
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216 (6x6x6). This number can be made infinite pretty 
easily, by adding alternatives which generate only junk 
code for example, like : 

S —> G S | mov eax, key T | push key; pop eax T 
or else 

G — >• G G | add edx, 1 ; dec edx | push eax; add esp, 4 

3.2 Classification of grammars 

Chomsky provided a well-known classification of gram- 
mars |Cho56|. He defined four main types, from Type 
to Type 3, each type defining a set of languages, each 
of them being a subset of the set described by any lower 
numbered grammar. In other words, Type are the most 
general grammars, while Type 3 are the most restrictive. 
Among these grammars, Type 2, also called context-free 
grammars, are the most popular. They describe context- 
free languages. Most of the programming languages are 
described by such grammars. The rules of Type 2 gram- 
mars have the following form : 

V 

Where U is a single non-terminal symbol, and V be- 
longs to (TV x T)*. 

In other words, U can be rewritten as a possibly 
empty sequence of terminal and non-terminal symbols. 
The name context-free comes from the fact that the left- 
hand side of a rewriting rule is a single non-terminal, so 
the rewriting does not depend of what may be next to it 
in a sentential form, unlike in Type and Type 1 gram- 
mars. We have the relation Type D Type 1 D Type 
2 D Type 3. Thus, Type grammars can define all the 
languages that are definable by Type 1, Type 2 or Type 3 
grammars. 

4 Van Wijngaarden grammars 
4.1 Context-sensitivity restrictions 

Context-sensitive languages are more complex than 
context free languages because one part of the string 
may "interact" with the structure of the other parts of 
the string. Once a non-terminal symbol has been pro- 
duced in a sentential form in a context-free grammar, 
its further development is independent of the rest of 
the sentential form, whereas a non-terminal symbol in 
a sentential form of a context-sensitive grammar has to 
look at its neighbours, on its left and on its right, to 
see what are the production rules that are allowed for it. 
So a context-free grammar cannot express some "long- 
range" relations. 

Yet, these relations are often valuable, as they make 
possible some fundamental properties of words to be 
described (like the only use of variables that have been 



declared). Programming languages are usually context- 
sensitive. For example a user is usually not allowed to 
use a variable that has not been created. So as it is not 
possible to express such properties through a context- 
free grammar, a solution, which is used most of the 
time, is to describe the structure of the correct words 
by a context-free grammar. The properties are checked 
by a separate program after that the word has been rec- 
ognize by the grammar (though it may not belong to the 
"real" language). However, this solution is not very sat- 
isfactory as the interest of using a grammar is to have 
a (formal) description of all the properties of the lan- 
guage. 

One can ask why a context-sensitive grammar is not 
used to describe the language. Actually this would pose 
some problems. Indeed, in general, context-sensitive 
languages cannot be parsed efficiently. Moreover, even 
though context-sensitive grammars have the power to 
express some long-ranged relations in a sentential form, 
they don't do it in a way that is easily understandable. 

Also it would make sense that after having written a 
grammar for a n b n c n , the writing of a n b n c n d n would 
work the same way. But this is not the case : the gram- 
mar for a n b n c n d n is more complex. The reason behind 
that is that to express a long-range relation, informa- 
tions have to flow through the sentential form, thanks 
to the non-terminal symbols (which look at their neigh- 
bours to rewrite a sentential form into another). Thus 
it requires almost all rules to know something about al- 
most all the other rules. 

Several grammar forms which make these relations 
more readable and easier to construct have been created. 
Among them are Van Wijngaarden grammars. 

4.2 VW grammar definition 

Basically, a VW grammar can be seen as the compo- 
sition of two context-free grammars (that is why such 
grammars are also called two-level grammars). The first 
context-free grammar is used to generate a set of ter- 
minal symbols which will act as non-terminals for the 
second context-free grammar. 

Before going further, a few terms have to be intro- 
duced. 

• A protonotion is a sequence of small syntactic 
marks ; 

• A metanotion is a sequence of big syntactic 
marks which is defined in a metarule ; 

• A hypernotion is a possibly empty sequence of 
metanotions and protonotions ; 

• A metarule defines a metanotion as a possibly 
empty sequence of hypernotions ; 
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• A hyper rule defines a sequence of hypernotions 
as another sequence of hypernotions, separated by 
a comma. Actually, they represent a possibly infi- 
nite set of production rules ; 

• A VW grammar is defined by a set of metarules (or 
metaproduction rules) and a set of hyperrules ; 

• Whenever a metanotion appears more than once 
in a hyperrule, each of its occurrence have to be 
replaced consistently throughout the rule. This is 
called the Uniform Replacement Rule. 

Definition 3. lvW MP + 77\l A Van Wijngaarden gram- 
mar is a grammar G = ( M, V, N, T, Rm> Rv> S ) with : 

M : a finite set of metanotions 

V : a finite set of metaterminals, M HV = $ 

N : a finite set of hypernotions, N C (M n V") + ) 

T : a finite set of terminals 

Rm •' a finite set of metarules, X — > Y with 

X G M, Y G (M n V)* such that for all 
W G M, (M, V, W,R M ) is a context-free 
grammar 

Ry : a finite set of hyperrules 

SeJV: the starting symbol 

The first set of rules are the metarules. They rep- 
resent a modified grammar in which the non-terminals 
are replaced by metanotions, and the terminals are re- 
placed by protonotions. The second set of rules are the 
hyperrules. They represent some possibly infinite set of 
production rules. 

In order to make a distinction between the metarules, 
the hyperrules, and the production rules, the production 
symbol is changed. Instead of the symbol '— >' we use 
'::' for the metarules and ':' for the hyperrules. To sep- 
arate the different alternatives of a rule, the symbol ';' 
is used instead of ' | ' . In metarules, members are sepa- 
rated by a blank, and in hyperrules, by a comma. The 
metanotions have to be chosen wisely, so that any se- 
quence of metanotions is not also a different sequence 
of metanotions. For instance, if we have a metanotion X 
and a metanotion Y, then the metanotion XY should be 
avoided as it would induce some ambiguity. 

To make it clearer, here is a VW grammar which de- 
scribes the language L = {a n b n c n | n >— 1} (i.e. abc, 
aabbcc, aaabbbccc, . . . ) : 

N:: iN;i. 
A :: a; b; c. 

S : aN, bN, cN. 
AiN : A symbol, AN. 
Ai : A symbol. 



The first two rules are the metarules, and the last three 
are the hyperrules. The metanotions are N and A. The 
hypernotions are AiN, Ai, A, AN, aN, bN, and cN. 

In the definition of a VW grammar, a member is a ter- 
minal symbol if it ends in symbol (like 'b symbol' for 
the terminal symbol 'b'), otherwise it is a non-terminal. 
So, here the rule "Ai : A symbol." produces the terminal 
symbols a, b and c. 

The metanotion N produces an infinite set of i. The 
i's act as a counter for the number of letters to be pro- 
duced. Indeed, as we said, the hypernotions describe 
a possibly infinite set of production rules. For instance 
here, the rule "AiN : A symbol, AN." actually produces 
the rules : 

aii : a symbol, ai. 
aiii : a symbol, aii. 
etc. . . 

bii : b symbol, bi. 
biii : b symbol, bii. 
etc. . . 

cii : c symbol, ci. 
ciii : c symbol, cii. 
etc. . . 

To obtain these sets, the metanotion A is replaced con- 
sistently by all the words it can generate. Here these are 
a, b and c. So we obtain the following three rules : 

aiN : a symbol, aN. 
biN : b symbol, bN. 
ciN : c symbol, cN. 

Then the same thing is done with the metanotion N. As 
it generates the infinite language L(N) = {i n \ n > 1} 
(i.e. T, 'W, 'iii'...), we obtain the above three sets of 
infinite production rules. 

4.3 Place in Chomsky's hierarchy 

By construction, Van Wijngaarden grammars do not be- 
long to any category of Chomsky's classification. How- 
ever, one can compare the expressive power of a Van 
Wijngaarden grammar and the different types of Chom- 
sky's hierarchy. In terms of expressive power, they are 
in fact equivalent to Type grammars. In a sense, they 
are even more powerful than Type grammars since 
they can handle infinite symbols sets. For instance, as 
shown in Figure [4] a Van Wijngaarden grammar can 
produce the set : 

s = { *" • • • *fc I n > 0, k > 0, ti ■ ■ ■ t k are different 
symbols } 
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N::nN; e. 
C::i;iC. 

S : N i tail. 
N C tail : N, C, N C i tail ; e. 
N n C : C symbol, N C. 
C : e. 

Figure 4: A grammar handling an infinite alphabet 

A Type grammar cannot generate this set since its 
number of (terminal) symbols is infinite. 

Sintzoff [Sin67| showed that there exists a Van Wi- 
jngaarden grammar for every semi-thue system^] Van 
Wijngaarden [Wij74] showed that a Van Wijngaarden 
grammar can simulate a Turing Machine. Thus, both 
proved that these grammars are at least as powerful as 
Type grammars (i.e. that they are Turing complete). 
As a consequence, parsing of these grammars is un- 
decidable in general. On a side note, it is to be noted 
that, if the first set of rules, i.e. the metarules, does not 
generate an infinite language, then the Van Wijngaarden 
grammar is equivalent to a standard context-free gram- 
mar. Indeed, if the language generated by a metarule is 
finite, one can write as much production rules as there 
is words in the language, and the consistent substitution 
can be "emulated" by the addition of rules which pro- 
duce only one sentence. For instance the grammar : 

S ^ PI BODY P2 | P3 BODY P4 
PI ( 
P2^) 
P3^< 
P4^> 

ensures that the opening bracket matches the ending 
one. By increasing the number of rules of the gram- 
mar, we can express more and more context-sensitive 
conditions. It follows that if we have an infinite collec- 
tion of context-free grammar rules, we can express any 
number of context-sensitive conditions, and so we can 
achieve full context-sensitivity. As said in the beginning 
of this section, this is the idea behind Van Wijngaarden 
grammars : a VW grammar can be seen as the compo- 
sition of two context-free grammars. The first context- 
free grammar is used to generate a language which can 
in turn be described by the second context-free gram- 
mar. Nonetheless, as mentioned in the previous section, 
it is possible to produce every words of the languages 
they may describe. 

2 A semi-thue system is a string rewriting system. It is equivalent 
to Chomsky's Type grammars. 



4.4 VW grammars and word generation 

Dick Grune [Gm84] made a program which can pro- 
duce all the sentences of a Van Wijngaarden grammar. 
The program reads a grammar on its input, and then the 
generation of the words starts. If the input's grammar 
describes an infinite language, then an infinite number 
of words will be produced. We modified some parts of 
this program in order to implement our mutation engine, 
and we have written a VW grammar based on the x86 
instructions set. 

It is not possible to generate the words of a Van Wi- 
jngaarden grammar in the same way that those of a 
context-free grammar are. Indeed, to generate a ter- 
minal production for a context-free language, we start 
from the start symbol. Intermediate results of a produc- 
tion (sentential forms) are stored in a queue. To rewrite 
a sentential form, we consider initially the first senten- 
tial form in the queue. Then, we search for a sequence 
of symbols which match the left-hand side of a produc- 
tion rule. If such a match is found, the sentential form 
is replaced by all its alternatives by making as much 
copies as the number of alternatives, and each copy is 
appended at the end of the queue. If no match is found, 
it means the sentential form is a terminal production. 

This process cannot be applied to Van Wijngaarden 
grammars, as there may be an infinite number of left- 
hand side resulting from a same hyperrule. Actually, it 
would require us to scan all the possible left-hand side 
of the hyperrule, so you may have to look at an infinite 
number of left-hand side to know if there is a possible 
match. In theory this takes an infinite amount of time, 
but a solution to this problem can be found. The main 
issue comes from the fact that a metanotion can generate 
an infinite language (i.e. an infinite number of words). 
What we want to do is to find the terminal productions 
of the metanotions which are in the left-hand side of 
the hyperrule so that, after substitution, it corresponds 
to the sentential form. So, we want to parse the sen- 
tential form in accordance to the "metagrammar", with 
the left-hand side of the hyperrule as the starting form. 
When the parsing is done, we can deduce which are the 
terminal productions that have to be used to match the 
sentential form. As the metagrammar is a context-free 
language, it can be parsed efficiently. So the problem 
can be solved. Thus, with this mechanism a member 
is considered to be a terminal symbol if no match is 
found in the left-hand side of the hyperrules. So it is not 
needed to append the symbol "symbol" at the end of a 
member to make it a terminal symbol. 

Now, we know how to produce words from a VW 
grammar. We know too that VW grammars can han- 
dle context-sensitivity. So now we want to write rules 
which transform one sentential form into another one, 
while preserving its semantic (its context's informa- 
tion). In order to do so, we modified a little the mech- 
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anism of the grammar : the word we want to transform 
is used as the starting word, and we do not try to parse 
it. In fact, a sort of parsing is handled by the way the 
production process works. Moreover, we use a random 
generator during the production process, to enable the 
production to randomly generate any word of the lan- 
guage described by the grammar. As an example take 
these metarules : 

N:: 0; 1; 2;. . .; 9; ON; IN;. . .; 9N. 
HEX :: N; a; b; . . . ; f; a HEX; b HEX; 

. . .; f HEX. 
ADR :: OxN. 
NUM :: ADR; HEX. 
INST :: mov; push; pop. 
REG :: eax; ebx; edx. 
STACK:: esp. 
REGS :: STACK; REG. 
REGNUM :: REGS; NUM. 

MEM:: [REGS]; [ADR]. 
COMMA:: 

The metanotion NUM represents an address or an hex- 
adecimal number. The metanotion INST represents 
three instructions (mov, push and pop). And so on.. 
The hyperrules : 

mov REGS COMMA REGNUM : 

move REGNUM in REGS. 

push REGNUM : 

save REGNUM. 

pop REGS : 

restore REGS. 

modify an instruction into a readable sentence. For ex- 
ample the word "mov eax, 0" will be replaced by "move 
in eax", because of the first hyperrule. 
We can add hyperrules which will transform these sen- 
tence into other equivalent sentence(s) : 

move REGNUM in MEM : 

mov, MEM, COMMA, REGNUM; 
move REGNUM in REGS : 

mov, REGS, COMMA, REGNUM; 

save REGNUM, restore REGS, 
save REGNUM : push, REGNUM; 

subtract 4 from esp, move 

REGNUM in [esp]. 
restore REGS : pop, REGS; 

move [ esp ] in REGS, ADD 4 to esp. 



Now the sentential form obtained before ("move 
in eax") can be replaced by either "mov, eax, 0" or 
by "save 0, restore eax". If the first alternative is se- 
lected, then the generation will stop. Indeed, the sen- 
tential form is composed of "mov", "eax", " ',' " and 
"0", and none of these words match a left-hand side of 
a hyperrule. On the other side, if the second alterna- 
tive is selected then the generation continues, and both 
parts of the sentential form, "save 0" and "restore eax", 
can be replaced independently from each other. Thus, 
the sentential form "save 0" can be replaced by "push, 
0" (so the generation stops) or by "subtract 4 from esp, 
move in [ esp ]", etc. 

The metarules used above can be more sophisticated 
so they generate an infinite set of instructions, and so 
the hyperrules generate an infinite number of produc- 
tion rules. Hence we can have a (infinite) rewriting sys- 
tem handling an infinite number of instructions. 

5 K-my viruses 

5.1 What is a K-nry viruses 

Definition 4 ([Fil07a]). A K-ary virus is composed of a 
family of k files ( some of which may not be executable ), 
whose union constitutes a computer virus and performs 
an offensive action that is equivalent of that of a true 
virus. Such a code is said sequential if the k constituent 
parts are acting strictly one after the another. It is said 
parallel if the k parts executes simultaneously. 

The interest of combined virus lies in the fact that the 
viral information is split in various parts, which taken 
separately can have a non-malicious behaviour. Be- 
cause of this separation of the viral information, we are 
out of the scope of Cohen's model. His model supposes 
that a virus is made of a unique sequence of symbols, 
which is not the case with combined viruses. 

Two main classes of Zf-ary viruses have been identi- 
fied IFil07al : 

• Class 1 codes. These are the codes that work se- 
quentially. 

This class is composed of three subclasses : 

- Subclass A. Each code refers or contains a 
reference to the others. Thus, the detection 
of one of these codes leads to the detections 
of all of the others. 

- Subclass B. None of the codes refers of con- 
tains a reference to the others. Thus, detect- 
ing one code does not affect the other codes. 
The detected code can be replaced by another 
code. 

- Subclass C. The dependence of the code is 
directed. Thus detecting one code does not 
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affect the codes which are before it in the se- 
quential execution. 

• Class 2 codes. These are the codes that work in 
parallel. This class is composed of the same three 
subclasses as the class 1 . 

5.2 Van Wijngaarden representation 

The power of a K-ary virus lies in the fact that it is split 
in several parts. Thus, one can see a K-ary virus as a 
distributed program whose global action is the same as 
that of a virus. If we look at this type of program from 
the point of view of formal grammars, we can feel that 
such a program can be described by them. 

Definition 5. Let x x , x 2 be two files, and v G L(G V ) a 
virus. We define the relation R v by 

x 1 R v x 2 {3uj e (xi ® x 2 ) | cj 6 L(G V )} 

The © operator is a selection function, whose result 
is a set of words over its input. The idea is that is does 
a selection of some parts of its inputs to extract a word 
from them, and if one of the results is in the language 
generated by G v then its inputs form a K-ary virus. 

The different parts of a K-ary virus can each be de- 
scribed separately by a grammar. If we put all these 
parts together, we have the description of the virus as a 
whole. Thus a Van Wijngaarden grammar can be used 
to define K-ary virus. The starting symbol produces all 
the parts of the K-ary virus, then the different parts are 
recognized by some hyperrules of the grammar. The 
consistent substitution allows some informations to be 
shared between each parts while they are created. As an 
example, for a K-ary virus with K=3, the rules would 
look like : 

S : PARTI INFOS, PART2 INFOS, PART3 INFOS 
PARTI INFOS : 

VW-Grammar of PARTI knowing INFOS 
PART2 INFOS : 

VW-Grammar of PART2 knowing INFOS 
PART3 INFOS : 

VW-Grammar of PART3 knowing INFOS 

Once the combined virus is produced (that is, that 
we have different files that contains the elements of the 
virus) each part may mutate on its own. While K-ary 
malware have been formally defined [Fil07a| and their 
detection addressed, our approach enables to formalize 
the automatic generation of K-ary malware while pro- 
viding a constructive proof. 



6 Conclusion 

Van Wijngaarden grammars are very powerful, and can 
be easily understood by a human. The power of these 
grammars comes from the two context-free grammars 
that are jointly used, coupled to the uniform replace- 
ment rule which allows context-sensitive conditions to 
be expressed. It is thus possible to handle undecidable 
problems suitable to design undetectable malwarse in 
a far easier way than considering formal grammars of 
class directly. 

K-ary virus have been defined through the use of a 
Van Wijngaarden grammar. The main idea is that the al- 
ternatives of the starting symbol are actually themselves 
the starting symbol of a grammar, describing each part 
(file) that the virus is composed of. This formal defini- 
tion produces a constructive method to generate those 
codes automatically. 
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